Adrian Marino

Claudio Collado

Inicialización

Fijamos la semilla para poder reproducir los experimentos. También se fija el numero de CPU’s a utilizar.

set.seed(42)
options(mc.cores = 24)

Librerias

Se importan las librerías a utilizar a lo largo de la notebook:

# install.packages(pacman)
# install.packages("https://cran.r-project.org/src/contrib/rstan_2.21.2.tar.gz",repos = NULL,type="source")
# sudo apt-get install libglpk-dev
library(pacman)
p_load(tidyverse, tidymodels, rsample, rstan, shinystan, rstanarm, devtools)

source('../src/dataset.R')
source('../src/plot.R')
source('../src/model.R')

Dataset y Analisis Exploratorio

Palmer Penguins

Palmer Penguins

1. Lectura del dataset

dataset <- load_dataset() %>% mutate_if(is.character, as.factor)
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset %>% glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, …
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 380…
$ sex               <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female, m…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…

2. Variables

Se enumeran y describen breve-mente cada variable que forma parte del dataset:

Variables Numéricas:

  • bill_length_mm: Longitud del pico del individuo medida en milímetros (también conocida como longitud del culmen).
  • bill_depth_mm: Profundidad del pico del individuo medida en milímetros (también conocida como profundidad del culmen).
  • flipper_length_mm: Longitud de la aleta del individuo medida en milímetros.
  • body_mass_g: Masa corporal del individuo medida en gramos.
  • year: Año en el que se registra el individuo.

Variables Categóricas:

  • species: Especie del individuo (Adelie, Gentoo ;) o Chinstrap).
  • sex: Sexo del individuo.
  • island: Isla donde se encontré el individuo (Biscoe, Dream o Torgersen).

3. Resumen de faltantes

missings_summary(dataset)

4. Varibles numericas

hist_plots(dataset)

Observaciones

  • Se aprecia que cada año se registro prácticamente el mismo numero de individuos.
  • La distribución de la masa corporal de los individuos tiene una asimétrica positiva. Tenemos muchos individuos con valores bajos de masa corporal, con una media de 192 gramos. Luego tenemos menos individuos con valores mas alto.
  • La longitud de la aleta parece ser una distribución bi-modal. Tenemos dos modas una 192 mm y otra en 215 mm.
  • La longitud del pico también parece tener una ligera simetría positiva. Es decir que lo individuos con menor peso tiene pico mas pequeños.
  • Por otro lado la profundidad de pico parece tener una ligera simetría positiva.
box_plots(dataset)

Observaciones

  • COMPLETAR

Outliers

No se registran valores mas extremos que el mínimo y máximo valor en cada variables. Es decir que no encontramos outliers.

outliers(dataset, column='bill_length_mm')
$inf
[1] 32.1

$sup
[1] 59.6

outliers(dataset, column='bill_length_mm')
$inf
[1] 32.1

$sup
[1] 59.6

outliers(dataset, column='bill_depth_mm')
$inf
[1] 13.1

$sup
[1] 21.5

outliers(dataset, column='flipper_length_mm')
$inf
[1] 172

$sup
[1] 231

outliers(dataset, column='body_mass_g')
$inf
[1] 2700

$sup
[1] 6300

outliers(dataset, column='year')
$inf
[1] 2007

$sup
[1] 2009

bar_plots(dataset)

Observaciones

  • La variable sexo se encuentra balanceada. Por otro lado, contiene algunos valores faltantes.
  • La variable island esta completamente desbalanceada. Esto seguramente se debe a una diferencia en numero en las poblaciones en cada isla o a un sesgo al momento de registrar los individuos. Es decir que registramos con individuos en una isla que en otra.
  • Lo mismo sucede con las especies de individuos. Vemos un gran desbalance entre la especie Chinstrap vs. otra especies. Por otro aldo Adelie y Gentoo tiene un conteo mas cercano

5. Excluir observaciones con missings

dataset <- dataset %>% drop_na()
missings_summary(dataset)
Warning: attributes are not identical across measure variables;
they will be dropped

6. Correlaciones

corr_plot(dataset %>% dplyr::select(-year))

segmented_pairs_plot(dataset, segment_column='species')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Experimentos

Experimento 1

  • Solo variables numéricas.
  • Regresión múltiple frecuentista.
  • Regresión múltiple bayesiana con priors normales y exponencial.

1. Split train vs. test

train_test <- train_test_split(dataset, train_size = 0.7, shuffle = TRUE)
[1] "Train set size: 233"
[1] "Test set size: 100"
train_set <- train_test[[1]]
test_set  <- train_test[[2]]

2. Modelo lineal

lineal_model_1 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set
)

summary(lineal_model_1)

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, 
    data = train_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-868.52 -260.19  -17.83  239.28 1269.72 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6074.063    654.637  -9.279   <2e-16 ***
bill_length_mm        3.530      6.297   0.561    0.576    
bill_depth_mm        12.534     16.232   0.772    0.441    
flipper_length_mm    49.318      2.874  17.157   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 382.4 on 229 degrees of freedom
Multiple R-squared:  0.7755,    Adjusted R-squared:  0.7725 
F-statistic: 263.6 on 3 and 229 DF,  p-value: < 2.2e-16

Quitamos los coeficientes que no son significativos para explicar a la variable body_mass_g:

lineal_model_1 <- lm(body_mass_g ~ flipper_length_mm, data = train_set)

summary(lineal_model_1)

Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = train_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-857.93 -257.17  -17.05  232.95 1278.39 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5688.500    353.398  -16.10   <2e-16 ***
flipper_length_mm    49.240      1.749   28.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 381.7 on 231 degrees of freedom
Multiple R-squared:  0.7743,    Adjusted R-squared:  0.7734 
F-statistic: 792.7 on 1 and 231 DF,  p-value: < 2.2e-16

3. Modelo bayesiano

bayesion_model_1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(0, 8000);  // Intercept
      beta1 ~ normal(0, 100);   // flipper_length_mm
      sigma ~ exponential(0.1); // Varianza

      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      x         = colvalues(train_set, 'flipper_length_mm'),
      y         = colvalues(train_set, 'body_mass_g')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=22509 on localhost:11799 at 23:30:25.914
starting worker pid=22546 on localhost:11799 at 23:30:26.021
starting worker pid=22583 on localhost:11799 at 23:30:26.130

SAMPLING FOR MODEL 'c017d4fe4edb4b602fa33b4a46efc8cf' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.127886 seconds (Warm-up)
Chain 1:                0.087164 seconds (Sampling)
Chain 1:                0.21505 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'c017d4fe4edb4b602fa33b4a46efc8cf' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.6e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.13672 seconds (Warm-up)
Chain 2:                0.075576 seconds (Sampling)
Chain 2:                0.212296 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'c017d4fe4edb4b602fa33b4a46efc8cf' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.157134 seconds (Warm-up)
Chain 3:                0.067661 seconds (Sampling)
Chain 3:                0.224795 seconds (Total)
Chain 3: 
params_1 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_1, pars = params_1, inc_warmup = TRUE)

4. Coeficientes

lm_vs_br_coeficients(lineal_model_1, bayesion_model_1, params_1)

4. Validación

vars_1 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
bayesion_predictor_1 <- BayesianRegressionPredictor.from(bayesion_model_1, params_1, vars_1)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_1, 
  train_set,
  label_1 = 'Regresion Lineal', 
  label_2 = 'Regresion Bayesiana'
)

Experimento 2

  • Idem a experimento 1, incorporando una variable categórica.
  • Regresión múltiple frecuentista.
  • Regresión bayesiana con priors normales y exponencial.

1. Modelo lineal

lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex,

  data = train_set
)

summary(lineal_model_2)

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    sex, data = train_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-755.02 -255.89   -6.03  221.97  989.38 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1922.6152   729.5654  -2.635  0.00898 ** 
bill_length_mm       -0.5331     5.4439  -0.098  0.92208    
bill_depth_mm       -92.9217    18.2692  -5.086 7.62e-07 ***
flipper_length_mm    37.2016     2.8209  13.188  < 2e-16 ***
sexmale             534.1583    59.5409   8.971  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 329.5 on 228 degrees of freedom
Multiple R-squared:  0.834, Adjusted R-squared:  0.8311 
F-statistic: 286.5 on 4 and 228 DF,  p-value: < 2.2e-16

Quitamos los coeficientes que no son significativos para explicar a la variable body_mass_g:

lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_depth_mm
      + flipper_length_mm
      + sex,
  data = train_set
)

summary(lineal_model_2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + flipper_length_mm + 
    sex, data = train_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-758.30 -258.55   -5.04  223.02  989.59 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1912.697    720.936  -2.653  0.00853 ** 
bill_depth_mm       -93.106     18.133  -5.135 6.04e-07 ***
flipper_length_mm    37.053      2.373  15.617  < 2e-16 ***
sexmale             533.673     59.206   9.014  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 328.7 on 229 degrees of freedom
Multiple R-squared:  0.834, Adjusted R-squared:  0.8319 
F-statistic: 383.6 on 3 and 229 DF,  p-value: < 2.2e-16

2. Modelo bayesiano

Construimos una matriz con todas las variables(X) mas el intercept:

to_model_input <- function(df) {
  model.matrix(
    body_mass_g 
      ~ bill_depth_mm
      + flipper_length_mm
      + sex,
    data = df
  )
}

train_X <- train_set %>% to_model_input()
test_X  <- test_set %>% to_model_input()

head(train_X)
  (Intercept) bill_depth_mm flipper_length_mm sexmale
1           1          18.8               197       1
2           1          14.6               211       0
3           1          19.1               195       1
4           1          15.9               224       1
5           1          18.5               201       1
6           1          18.3               195       1

Definimos y corremos el modelo bayesiano:

bayesion_model_2 <- stan(
  model_code = "
    data {
      int<lower=1>                 obs_count;
      int<lower=1>                 coef_count;
      matrix[obs_count,coef_count] X;
      vector[obs_count]            y;
    }
    parameters {
      vector[coef_count]  beta;
      real<lower=0>       sigma;
    }
    model {
      // Distribuciones a priori
      beta[1] ~ normal(0, 3000); // Intecept = beta0 + sexfemale
      beta[2] ~ normal(0, 200);  // bill_depth_mm
      beta[3] ~ normal(0, 200);  // flipper_length_mm
      beta[4] ~ normal(0, 600);  // sexmale
  
      sigma ~ exponential(0.1);  // Varianza
  
      // Likelihood
      y ~ normal(X * beta, sigma);
    }
  ",
  data = list(
      obs_count  = dim(train_X)[1],
      coef_count = dim(train_X)[2],
      y          = colvalues(train_set, 'body_mass_g'),
      X          = train_X
  ),
  chains = 3,
  iter   = 4000,
  warmup = 500,
  thin   = 1
)
starting worker pid=22865 on localhost:11799 at 23:31:34.101
starting worker pid=22902 on localhost:11799 at 23:31:34.210
starting worker pid=22939 on localhost:11799 at 23:31:34.320

SAMPLING FOR MODEL '2aabf6de2b4d6f89278592cbff82547b' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 1: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 1: Iteration:  501 / 4000 [ 12%]  (Sampling)
Chain 1: Iteration:  900 / 4000 [ 22%]  (Sampling)
Chain 1: Iteration: 1300 / 4000 [ 32%]  (Sampling)

SAMPLING FOR MODEL '2aabf6de2b4d6f89278592cbff82547b' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.9e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 1: Iteration: 1700 / 4000 [ 42%]  (Sampling)

SAMPLING FOR MODEL '2aabf6de2b4d6f89278592cbff82547b' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.6e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 1: Iteration: 2100 / 4000 [ 52%]  (Sampling)
Chain 2: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 2: Iteration:  501 / 4000 [ 12%]  (Sampling)
Chain 1: Iteration: 2500 / 4000 [ 62%]  (Sampling)
Chain 3: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 3: Iteration:  501 / 4000 [ 12%]  (Sampling)
Chain 1: Iteration: 2900 / 4000 [ 72%]  (Sampling)
Chain 2: Iteration:  900 / 4000 [ 22%]  (Sampling)
Chain 1: Iteration: 3300 / 4000 [ 82%]  (Sampling)
Chain 3: Iteration:  900 / 4000 [ 22%]  (Sampling)
Chain 2: Iteration: 1300 / 4000 [ 32%]  (Sampling)
Chain 1: Iteration: 3700 / 4000 [ 92%]  (Sampling)
Chain 3: Iteration: 1300 / 4000 [ 32%]  (Sampling)
Chain 2: Iteration: 1700 / 4000 [ 42%]  (Sampling)
Chain 1: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.502712 seconds (Warm-up)
Chain 1:                2.31541 seconds (Sampling)
Chain 1:                2.81812 seconds (Total)
Chain 1: 
Chain 2: Iteration: 2100 / 4000 [ 52%]  (Sampling)
Chain 3: Iteration: 1700 / 4000 [ 42%]  (Sampling)
Chain 2: Iteration: 2500 / 4000 [ 62%]  (Sampling)
Chain 3: Iteration: 2100 / 4000 [ 52%]  (Sampling)
Chain 2: Iteration: 2900 / 4000 [ 72%]  (Sampling)
Chain 3: Iteration: 2500 / 4000 [ 62%]  (Sampling)
Chain 2: Iteration: 3300 / 4000 [ 82%]  (Sampling)
Chain 3: Iteration: 2900 / 4000 [ 72%]  (Sampling)
Chain 2: Iteration: 3700 / 4000 [ 92%]  (Sampling)
Chain 3: Iteration: 3300 / 4000 [ 82%]  (Sampling)
Chain 2: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.622863 seconds (Warm-up)
Chain 2:                2.7078 seconds (Sampling)
Chain 2:                3.33067 seconds (Total)
Chain 2: 
Chain 3: Iteration: 3700 / 4000 [ 92%]  (Sampling)
Chain 3: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.507968 seconds (Warm-up)
Chain 3:                2.8828 seconds (Sampling)
Chain 3:                3.39077 seconds (Total)
Chain 3: 
params_2 <- c('beta[1]', 'beta[2]', 'beta[3]', 'beta[4]', 'sigma')
traceplot(bayesion_model_2, inc_warmup = TRUE, pars = params_2)

3. Coeficientes

lineal_model_2$coefficients
      (Intercept)     bill_depth_mm flipper_length_mm           sexmale 
      -1912.69656         -93.10594          37.05300         533.67323 
br_coeficients(bayesion_model_2, params_2)

4. Validación

vars_2 <- c('bill_depth_mm', 'flipper_length_mm', 'sexmale')

lm_vs_br_models_validation(
  lineal_model_2, 
  bayesion_model_2, 
  params_2,
  vars_2,
  test_set,
  as.data.frame(test_X)
)
bayesion_predictor_2 <- BayesianRegressionPredictor.from(bayesion_model_2, params_2, vars_2)

plot_compare_fit(
  lineal_model_2, 
  bayesion_predictor_2, 
  test_set, 
  as.data.frame(test_X),
  label_1 = 'Regresion Lineal', 
  label_2 = 'Regresion Bayesiana'
)

Experimento 3

  • Idem a experimento 1 incorporando outliers en alguna variable numérica.
  • Regresión múltiple frecuentista.
  • Regresión bayesiana con priors normales y exponencial.

1. Outliers

A continuación vamos a agregar outliers a la variable flipper_length_mm, la cual define la longitud de la aleta del individuo medida en milímetros.

Para visualizar los nuevos outliers a continuación graficamos flipper_length_mm vs. body_mass_g antes y después de agregar outliers:

plot_data(train_set)

train_set_with_outliers <- train_set %>%
  mutate(flipper_length_mm = ifelse(
    body_mass_g > 5400 & body_mass_g < 5700, 
    flipper_length_mm + (flipper_length_mm * runif(1, 0.1, 0.2)), 
    flipper_length_mm
  ))

plot_data(train_set_with_outliers)

2. Modelo lineal

lineal_model_3 <- lm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_with_outliers
)
summary(lineal_model_3)

Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = train_set_with_outliers)

Residuals:
   Min     1Q Median     3Q    Max 
-797.5 -301.7   -9.2  281.6 1165.8 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3722.454    305.332  -12.19   <2e-16 ***
flipper_length_mm    39.167      1.497   26.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 403.6 on 231 degrees of freedom
Multiple R-squared:  0.7477,    Adjusted R-squared:  0.7466 
F-statistic: 684.6 on 1 and 231 DF,  p-value: < 2.2e-16

Comparemos el ajuste del modelo lineal ajustando en un dataset de train con y sin outliers:

plot_compare_fit(
  lineal_model_1, 
  lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal CON outliters'
)

Se aprecia que el modelo entrenado en train_set_outliers esta apalancado por las observaciones atípicas

2. Modelo lineal Robusto

Entrenamos una regresión lineal múltiple robusta para intentar de disminuir el efecto de los nuevos outliers:

robust_lineal_model_3 <- rlm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_with_outliers
)
plot_compare_fit(
  lineal_model_1, 
  robust_lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal Robusta CON outliters'
)

Gráficamente no se llega a distinguir pero el modelo robusto termina ajustando mejor que el modelo lineal clásico. Esto se puede observar cuando comparamos el RMSE/MAE sobre train y test.

lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, train_set_with_outliers)
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, test_set)

3. Modelo bayesiano con Likelihood normal

bayesion_model_3 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(-6000, 500);  // Intercept (Muy informativa!!!)
      beta1 ~ normal(0, 100);      // flipper_length_mm (Muy informativa!!!)
      sigma ~ exponential(0.1);    // Varianza

      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      x = colvalues(train_set_with_outliers, 'flipper_length_mm'),
      y = colvalues(train_set_with_outliers, 'body_mass_g')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 300,
  thin   = 1
)
starting worker pid=29790 on localhost:11799 at 00:03:56.670
starting worker pid=29827 on localhost:11799 at 00:03:56.780
starting worker pid=29864 on localhost:11799 at 00:03:56.889

SAMPLING FOR MODEL '4c854be188b76190ba259392f0cefe60' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 1: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 1: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 1: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.242403 seconds (Warm-up)
Chain 1:                0.204858 seconds (Sampling)
Chain 1:                0.447261 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '4c854be188b76190ba259392f0cefe60' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 2: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 2: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 2: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.231599 seconds (Warm-up)
Chain 2:                0.248663 seconds (Sampling)
Chain 2:                0.480262 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL '4c854be188b76190ba259392f0cefe60' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 3: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 3: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 3: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.13834 seconds (Warm-up)
Chain 3:                0.227717 seconds (Sampling)
Chain 3:                0.366057 seconds (Total)
Chain 3: 
params_3 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_3, pars = params_3, inc_warmup = TRUE)

4. Coeficientes

lm_vs_br_coeficients(robust_lineal_model_3, bayesion_model_3, params_3)

5. Validación

vars_3 <- c('flipper_length_mm') 

lm_vs_br_models_validation(robust_lineal_model_3, bayesion_model_3, params_3, vars_3, test_set)
bayesion_predictor_3 <- BayesianRegressionPredictor.from(bayesion_model_3, params_3, vars_3)

plot_compare_fit(
  robust_lineal_model_3, 
  bayesion_predictor_3, 
  train_set,
  label_1='Regresion Lineal Robusta CON outliers', 
  label_2='Regresion Bayesiana CON outliers'
)

6. Modelo bayesiano con Likelihood t-student

bayesion_model_3.1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(-5700, 50);  // Intercept (Muy informativa!!!)
      beta1 ~ normal(49, 20);      // flipper_length_mm (Muy informativa!!!)
      sigma ~ exponential(0.1); // Varianza

      // Likelihood
      y ~ student_t(2, beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      x = colvalues(train_set_with_outliers, 'flipper_length_mm'),
      y = colvalues(train_set_with_outliers, 'body_mass_g')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 300,
  thin   = 1
)
starting worker pid=30911 on localhost:11799 at 00:08:24.510
starting worker pid=30948 on localhost:11799 at 00:08:24.618
starting worker pid=30985 on localhost:11799 at 00:08:24.729

SAMPLING FOR MODEL '1675203fd35ad686882090462fa7c61a' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 1: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 1: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 1: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.064665 seconds (Warm-up)
Chain 1:                0.099819 seconds (Sampling)
Chain 1:                0.164484 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '1675203fd35ad686882090462fa7c61a' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 2: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 2: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 2: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.150916 seconds (Warm-up)
Chain 2:                0.094938 seconds (Sampling)
Chain 2:                0.245854 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL '1675203fd35ad686882090462fa7c61a' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.5e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 3: Iteration: 301 / 1000 [ 30%]  (Sampling)
Chain 3: Iteration: 400 / 1000 [ 40%]  (Sampling)
Chain 3: Iteration: 500 / 1000 [ 50%]  (Sampling)
Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.119479 seconds (Warm-up)
Chain 3:                0.081591 seconds (Sampling)
Chain 3:                0.20107 seconds (Total)
Chain 3: 
params_3 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_3.1, pars = params_3, inc_warmup = TRUE)

7. Coeficientes

lm_vs_br_coeficients(lineal_model_1, bayesion_model_3.1, params_3)

8. Validación

vars_3 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_3.1, params_3, vars_3, test_set)
bayesion_predictor_3.1 <- BayesianRegressionPredictor.from(bayesion_model_3.1, params_3, vars_3)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_3.1, 
  train_set_with_outliers,
  label_1='Regresion Lineal SIN outliers', 
  label_2='Regresion Bayesiana CON outliers'
)

Experimento 4

  • Idem experimento 1 pero reduciendo la cantidad de observaciones a pocos valores (ej:30).
  • Regresión múltiple frecuentista.
  • Regresión bayesiana con priors normales y exponencial.

1. Split train - test

En este aso entrenamos solo con el 10% de lo datos.

train_test <- train_test_split(dataset, train_size = 0.05, shuffle = TRUE)
[1] "Train set size: 16"
[1] "Test set size: 317"
train_set_4 <- train_test[[1]]
test_set_4  <- train_test[[2]]
plot_data(train_set_4)

2. Modelo lineal

lineal_model_4 <- lm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_4
)

3. Modelo bayesiano

bayesion_model_4 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori
      beta0 ~ normal(0, 8000); // Intercept
      beta1 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_4),
      x = colvalues(train_set_4, 'flipper_length_mm'),
      y  = colvalues(train_set_4, 'body_mass_g')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=31466 on localhost:11799 at 00:15:59.730
starting worker pid=31503 on localhost:11799 at 00:15:59.838
starting worker pid=31540 on localhost:11799 at 00:15:59.948

SAMPLING FOR MODEL 'c4d62a81dae11941a9914c2e1cae7316' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.03853 seconds (Warm-up)
Chain 1:                0.010174 seconds (Sampling)
Chain 1:                0.048704 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'c4d62a81dae11941a9914c2e1cae7316' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 9e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.051002 seconds (Warm-up)
Chain 2:                0.013484 seconds (Sampling)
Chain 2:                0.064486 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'c4d62a81dae11941a9914c2e1cae7316' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 8e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.054605 seconds (Warm-up)
Chain 3:                0.010649 seconds (Sampling)
Chain 3:                0.065254 seconds (Total)
Chain 3: 
params_4 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_4, pars = params_4, inc_warmup = TRUE)

4. Coeficientes

Coeficientes de la regresión múltiple:

lineal_model_4$coefficients
      (Intercept) flipper_length_mm 
      -4668.30188          44.28301 

Coeficientes descubiertos por la regresión múltiple bayesiana:

for(param in params_4) print(get_posterior_mean(bayesion_model_4, par=param)[4])
[1] -4464.407
[1] 43.26266
[1] 261.9534

5. Validación

vars_4 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_4, bayesion_model_4, params_4, vars_4, test_set)
bayesion_predictor_4 <- BayesianRegressionPredictor.from(bayesion_model_4, params_4, vars_4)

plot_compare_fit(
  lineal_model_4,
  bayesion_predictor_4, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)

Experimento 5

  • Igual al experimento 1 pero proponiendo dos nuevas regresiones bayesianas con priors para los parámetros que sean:
    • Una poca informativa (uniforme).
    • Una muy informativa (sesgada o con muy poca varianza).
  • Comparar con resultados de la bayesiana del experimento A

1. Modelo bayesiano con parametro con distribucion poco informativa

1. Modelo

Definimos una distribución uniforme para el beta asociado a la variable flipper_length_mm.

starting worker pid=31882 on localhost:11799 at 00:19:15.014
starting worker pid=31919 on localhost:11799 at 00:19:15.123
starting worker pid=31956 on localhost:11799 at 00:19:15.233

SAMPLING FOR MODEL '776b18360e3917d6e2282a204c6c0960' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 1: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 1: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 1: Iteration: 480 / 1000 [ 48%]  (Sampling)
Chain 1: Iteration: 580 / 1000 [ 58%]  (Sampling)
Chain 1: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 1: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 1: Iteration: 880 / 1000 [ 88%]  (Sampling)
Chain 1: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.154943 seconds (Warm-up)
Chain 1:                0.349263 seconds (Sampling)
Chain 1:                0.504206 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '776b18360e3917d6e2282a204c6c0960' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.9e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 2: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 2: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 2: Iteration: 480 / 1000 [ 48%]  (Sampling)
Chain 2: Iteration: 580 / 1000 [ 58%]  (Sampling)
Chain 2: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 2: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 2: Iteration: 880 / 1000 [ 88%]  (Sampling)

SAMPLING FOR MODEL '776b18360e3917d6e2282a204c6c0960' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.16838 seconds (Warm-up)
Chain 2:                0.362988 seconds (Sampling)
Chain 2:                0.531368 seconds (Total)
Chain 2: 
Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 3: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 3: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 3: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 3: Iteration: 480 / 1000 [ 48%]  (Sampling)
Chain 3: Iteration: 580 / 1000 [ 58%]  (Sampling)
Chain 3: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 3: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 3: Iteration: 880 / 1000 [ 88%]  (Sampling)
Chain 3: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.137244 seconds (Warm-up)
Chain 3:                0.31702 seconds (Sampling)
Chain 3:                0.454264 seconds (Total)
Chain 3: 

2. Coeficientes

br_vs_br_coeficients(bayesion_model_1, bayesion_model_5, params_5)

3. Validación

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)

4. Validacion

vars_5 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_5, params_5, vars_5, test_set)
bayesion_predictor_5 <- BayesianRegressionPredictor.from(bayesion_model_5, params_5, vars_5)

plot_compare_fit(
  bayesion_predictor_1,
  bayesion_predictor_5,
  train_set,
  label_1='Regresion Bayesiana con dist informativa', 
  label_2='Regresion Bayesiana con dist menos informativa'
)

2. Modelo bayesiano con parametro con distribucion muy informativa sesgada o con poca varianza

COMPLETAR

---
title: "Enfoque Estadistico del Aprendizaje - Trabajo Final - Regresion Bayesiana"
fig_width: 3 
fig_height: 3 
output:
  html_document:
    highlight: pygments
    theme: sandstone
    toc: yes
    df_print: paged
    includes:
      before_body: ./header.html
  html_notebook: 
    toc: yes
    toc_float: yes
    df_print: paged
---

### Adrian Marino

### Claudio Collado


# Inicialización

Fijamos la semilla para poder reproducir los experimentos. También se fija el numero de CPU's a utilizar.

```{r}
set.seed(42)
options(mc.cores = 24)
```


# Librerias

Se importan las librerías a utilizar a lo largo de la notebook:

```{r message=FALSE, warning=FALSE}
# install.packages(pacman)
# install.packages("https://cran.r-project.org/src/contrib/rstan_2.21.2.tar.gz",repos = NULL,type="source")
# sudo apt-get install libglpk-dev
```

```{r message=FALSE, warning=FALSE}
library(pacman)
p_load(tidyverse, tidymodels, rsample, rstan, shinystan, rstanarm, devtools)

source('../src/dataset.R')
source('../src/plot.R')
source('../src/model.R')
```

# Dataset y Analisis Exploratorio

![Palmer Penguins](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png)

[Palmer Penguins](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md)


## 1. Lectura del dataset


```{r message=FALSE, warning=FALSE}
dataset <- load_dataset() %>% mutate_if(is.character, as.factor)

dataset %>% glimpse()
```


## 2. Variables

Se enumeran y describen breve-mente cada variable que forma parte del dataset:

Variables Numéricas:

* **bill_length_mm**: Longitud del pico del individuo medida en milímetros (también conocida como longitud del culmen).
* **bill_depth_mm**: Profundidad del pico del individuo medida en milímetros (también conocida como profundidad del culmen).
* **flipper_length_mm**: Longitud de la aleta del individuo medida en milímetros.
* **body_mass_g**: Masa corporal del individuo medida en gramos.
* **year**: Año en el que se registra el individuo.

Variables Categóricas:

* **species**: Especie del individuo (Adelie, Gentoo ;) o Chinstrap).
* **sex**: Sexo del individuo.
* **island**: Isla donde se encontré el individuo (Biscoe, Dream o Torgersen).


## 3. Resumen de faltantes

```{r message=FALSE,warning=FALSE}
missings_summary(dataset)
```

## 4. Varibles numericas

```{r fig.height=2, fig.width=3, message=TRUE, warning=FALSE}
hist_plots(dataset)
```


**Observaciones**

* Se aprecia que cada año se registro prácticamente el mismo numero de individuos.
* La distribución de la masa corporal de los individuos tiene una asimétrica positiva. Tenemos muchos individuos con valores bajos de masa corporal, con una media de 192 gramos. Luego tenemos menos individuos con valores mas alto.
* La longitud de la aleta parece ser una distribución bi-modal. Tenemos dos modas una 192 mm y otra en 215 mm.
* La longitud del pico también parece tener una ligera simetría positiva. Es decir que lo individuos con menor peso tiene pico mas pequeños.
* Por otro lado la profundidad de pico parece tener una ligera simetría positiva.


```{r fig.align='center', fig.height=5, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
box_plots(dataset)
```


**Observaciones**

* COMPLETAR


### Outliers

No se registran valores mas extremos que el mínimo y máximo valor en cada variables. Es decir que no encontramos outliers.

```{r, fig.show='hide'}
outliers(dataset, column='bill_length_mm')
```


```{r, fig.show='hide'}
outliers(dataset, column='bill_length_mm')
```


```{r, fig.show='hide'}
outliers(dataset, column='bill_depth_mm')
```

```{r, fig.show='hide'}
outliers(dataset, column='flipper_length_mm')
```

```{r, fig.show='hide'}
outliers(dataset, column='body_mass_g')
```


```{r, fig.show='hide'}
outliers(dataset, column='year')
```


```{r fig.height=3, fig.width=3, warning=FALSE}
bar_plots(dataset)
```

**Observaciones**

* La variable sexo se encuentra balanceada. Por otro lado, contiene algunos valores faltantes.
* La variable island esta completamente desbalanceada. Esto seguramente se debe a una diferencia en numero
en las poblaciones en cada isla o a un sesgo al momento de registrar los individuos. Es decir que registramos con individuos en una isla que en otra.
* Lo mismo sucede con las especies de individuos. Vemos un gran desbalance entre la especie Chinstrap vs. otra especies. Por otro aldo Adelie y Gentoo tiene un conteo mas cercano


## 5. Excluir observaciones con missings

```{r}
dataset <- dataset %>% drop_na()
missings_summary(dataset)
```

## 6. Correlaciones

```{r fig.align='center', fig.height=5, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
corr_plot(dataset %>% dplyr::select(-year))
```

```{r fig.align='center', fig.height=12, fig.width=12, message=FALSE, warning=FALSE, fig.align='center'}
segmented_pairs_plot(dataset, segment_column='species')
```

# Experimentos

## Experimento 1

* Solo variables numéricas.
* Regresión múltiple frecuentista.
* Regresión múltiple bayesiana con priors normales y exponencial.


### 1. Split train vs. test


```{r message=FALSE, warning=FALSE}
train_test <- train_test_split(dataset, train_size = 0.7, shuffle = TRUE)
train_set <- train_test[[1]]
test_set  <- train_test[[2]]
```


### 2. Modelo lineal

```{r}
lineal_model_1 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set
)

summary(lineal_model_1)
```

Quitamos los coeficientes que no son significativos para explicar a la variable **body_mass_g**:

```{r}
lineal_model_1 <- lm(body_mass_g ~ flipper_length_mm, data = train_set)

summary(lineal_model_1)
```


### 3. Modelo bayesiano

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(0, 8000);  // Intercept
      beta1 ~ normal(0, 100);   // flipper_length_mm
      sigma ~ exponential(0.1); // Varianza

      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      x         = colvalues(train_set, 'flipper_length_mm'),
      y         = colvalues(train_set, 'body_mass_g')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_1 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_1, pars = params_1, inc_warmup = TRUE)
```


### 4. Coeficientes

```{r}
lm_vs_br_coeficients(lineal_model_1, bayesion_model_1, params_1)
```

### 4. Validación


```{r}
vars_1 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_1 <- BayesianRegressionPredictor.from(bayesion_model_1, params_1, vars_1)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_1, 
  train_set,
  label_1 = 'Regresion Lineal', 
  label_2 = 'Regresion Bayesiana'
)
```


## Experimento 2

* Idem a experimento 1, incorporando una variable categórica.
* Regresión múltiple frecuentista.
* Regresión bayesiana con priors normales y exponencial.


### 1. Modelo lineal


```{r}
lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex,

  data = train_set
)

summary(lineal_model_2)
```

Quitamos los coeficientes que no son significativos para explicar a la variable **body_mass_g**:

```{r}
lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_depth_mm
      + flipper_length_mm
      + sex,
  data = train_set
)

summary(lineal_model_2)
```

### 2.  Modelo bayesiano

Construimos una matriz con todas las variables(X) mas el intercept:

```{r}
to_model_input <- function(df) {
  model.matrix(
    body_mass_g 
      ~ bill_depth_mm
      + flipper_length_mm
      + sex,
    data = df
  )
}

train_X <- train_set %>% to_model_input()
test_X  <- test_set %>% to_model_input()

head(train_X)
```


Definimos y corremos el modelo bayesiano:

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_2 <- stan(
  model_code = "
    data {
      int<lower=1>                 obs_count;
      int<lower=1>                 coef_count;
      matrix[obs_count,coef_count] X;
      vector[obs_count]            y;
    }
    parameters {
      vector[coef_count]  beta;
      real<lower=0>       sigma;
    }
    model {
      // Distribuciones a priori
      beta[1] ~ normal(0, 3000); // Intecept = beta0 + sexfemale
      beta[2] ~ normal(0, 200);  // bill_depth_mm
      beta[3] ~ normal(0, 200);  // flipper_length_mm
      beta[4] ~ normal(0, 600);  // sexmale
  
      sigma ~ exponential(0.1);  // Varianza
  
      // Likelihood
      y ~ normal(X * beta, sigma);
    }
  ",
  data = list(
      obs_count  = dim(train_X)[1],
      coef_count = dim(train_X)[2],
      y          = colvalues(train_set, 'body_mass_g'),
      X          = train_X
  ),
  chains = 3,
  iter   = 4000,
  warmup = 500,
  thin   = 1
)

params_2 <- c('beta[1]', 'beta[2]', 'beta[3]', 'beta[4]', 'sigma')
traceplot(bayesion_model_2, inc_warmup = TRUE, pars = params_2)
```


### 3. Coeficientes

```{r}
lineal_model_2$coefficients
```

```{r}
br_coeficients(bayesion_model_2, params_2)
```

### 4. Validación

```{r}
vars_2 <- c('bill_depth_mm', 'flipper_length_mm', 'sexmale')

lm_vs_br_models_validation(
  lineal_model_2, 
  bayesion_model_2, 
  params_2,
  vars_2,
  test_set,
  as.data.frame(test_X)
)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_2 <- BayesianRegressionPredictor.from(bayesion_model_2, params_2, vars_2)

plot_compare_fit(
  lineal_model_2, 
  bayesion_predictor_2, 
  test_set, 
  as.data.frame(test_X),
  label_1 = 'Regresion Lineal', 
  label_2 = 'Regresion Bayesiana'
)
```




## Experimento 3

* Idem a experimento 1 incorporando outliers en alguna variable numérica.
* Regresión múltiple frecuentista.
* Regresión bayesiana con priors normales y exponencial.


### 1. Outliers

A continuación vamos a agregar outliers a la variable **flipper_length_mm**, la cual define la longitud de la aleta del individuo medida en milímetros. 

Para visualizar los nuevos outliers a continuación graficamos **flipper_length_mm** vs.  **body_mass_g** antes y después de agregar outliers:

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
plot_data(train_set)
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
train_set_with_outliers <- train_set %>%
  mutate(flipper_length_mm = ifelse(
    body_mass_g > 5400 & body_mass_g < 5700, 
    flipper_length_mm + (flipper_length_mm * runif(1, 0.1, 0.2)), 
    flipper_length_mm
  ))

plot_data(train_set_with_outliers)
```

### 2. Modelo lineal

```{r}
lineal_model_3 <- lm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_with_outliers
)
summary(lineal_model_3)
```


Comparemos el ajuste del modelo lineal ajustando en un dataset de train con y sin outliers:

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
plot_compare_fit(
  lineal_model_1, 
  lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal CON outliters'
)
```
 
Se aprecia que el modelo entrenado en **train_set_outliers** esta apalancado por las observaciones atípicas

### 2. Modelo lineal Robusto

Entrenamos una regresión lineal múltiple robusta para intentar de disminuir el efecto de los nuevos outliers:

```{r}
robust_lineal_model_3 <- rlm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_with_outliers
)
```


```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
plot_compare_fit(
  lineal_model_1, 
  robust_lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal Robusta CON outliters'
)
```

Gráficamente no se llega a distinguir pero el modelo robusto termina ajustando mejor que el modelo lineal clásico. Esto se puede observar cuando comparamos el RMSE/MAE sobre train y test.


```{r}
lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, train_set_with_outliers)
```


```{r}
lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, test_set)
```


### 3. Modelo bayesiano con Likelihood normal


```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_3 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(0, 1000);  // Intercept
      beta1 ~ normal(0, 100);   // flipper_length_mm
      sigma ~ exponential(0.1); // Varianza

      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      x = colvalues(train_set_with_outliers, 'flipper_length_mm'),
      y = colvalues(train_set_with_outliers, 'body_mass_g')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 300,
  thin   = 1
)

params_3 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_3, pars = params_3, inc_warmup = TRUE)
```


### 4. Coeficientes

```{r}
lm_vs_br_coeficients(robust_lineal_model_3, bayesion_model_3, params_3)
```

### 5. Validación


```{r}
vars_3 <- c('flipper_length_mm') 

lm_vs_br_models_validation(robust_lineal_model_3, bayesion_model_3, params_3, vars_3, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_3 <- BayesianRegressionPredictor.from(bayesion_model_3, params_3, vars_3)

plot_compare_fit(
  robust_lineal_model_3, 
  bayesion_predictor_3, 
  train_set,
  label_1='Regresion Lineal Robusta CON outliers', 
  label_2='Regresion Bayesiana CON outliers'
)
```


### 6. Modelo bayesiano con Likelihood t-student


```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_3.1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori (Fijado por el investigador)
      beta0 ~ normal(-5700, 50);  // Intercept (Muy informativa!!!)
      beta1 ~ normal(49, 20);      // flipper_length_mm (Muy informativa!!!)
      sigma ~ exponential(0.1); // Varianza

      // Likelihood
      y ~ student_t(2, beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      x = colvalues(train_set_with_outliers, 'flipper_length_mm'),
      y = colvalues(train_set_with_outliers, 'body_mass_g')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 300,
  thin   = 1
)

params_3 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_3.1, pars = params_3, inc_warmup = TRUE)
```

### 7. Coeficientes

```{r}
lm_vs_br_coeficients(lineal_model_1, bayesion_model_3.1, params_3)
```

### 8. Validación


```{r}
vars_3 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_3.1, params_3, vars_3, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_3.1 <- BayesianRegressionPredictor.from(bayesion_model_3.1, params_3, vars_3)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_3.1, 
  train_set_with_outliers,
  label_1='Regresion Lineal SIN outliers', 
  label_2='Regresion Bayesiana CON outliers'
)
```

## Experimento 4


* Idem experimento 1 pero reduciendo la cantidad de observaciones a pocos valores (ej:30).
* Regresión múltiple frecuentista.
* Regresión bayesiana con priors normales y exponencial.


### 1. Split train - test

En este aso entrenamos solo con el 10% de lo datos.

```{r message=FALSE, warning=FALSE}
train_test <- train_test_split(dataset, train_size = 0.05, shuffle = TRUE)
train_set_4 <- train_test[[1]]
test_set_4  <- train_test[[2]]
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
plot_data(train_set_4)
```

### 2. Modelo lineal

```{r}
lineal_model_4 <- lm(
  body_mass_g ~ flipper_length_mm,
  data = train_set_4
)
```

### 3. Modelo bayesiano

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_4 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori
      beta0 ~ normal(0, 8000); // Intercept
      beta1 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_4),
      x = colvalues(train_set_4, 'flipper_length_mm'),
      y  = colvalues(train_set_4, 'body_mass_g')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_4 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_4, pars = params_4, inc_warmup = TRUE)
```

### 4. Coeficientes

Coeficientes de la regresión múltiple:


```{r}
lineal_model_4$coefficients
```

Coeficientes descubiertos por la regresión múltiple bayesiana:

```{r}
for(param in params_4) print(get_posterior_mean(bayesion_model_4, par=param)[4])
```


### 5. Validación


```{r}
vars_4 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_4, bayesion_model_4, params_4, vars_4, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_4 <- BayesianRegressionPredictor.from(bayesion_model_4, params_4, vars_4)

plot_compare_fit(
  lineal_model_4,
  bayesion_predictor_4, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)
```


## Experimento 5

* Igual al experimento 1 pero proponiendo dos nuevas regresiones bayesianas con priors para los parámetros que sean:
  * Una poca informativa (uniforme).
  * Una muy informativa (sesgada o con muy poca varianza).
* Comparar con resultados de la bayesiana del experimento A

### 1. Modelo bayesiano con parametro con distribucion poco informativa

#### 1. Modelo

Definimos una [distribución uniforme](https://mc-stan.org/docs/2_21/functions-reference/uniform-distribution.html) para el beta asociado a la variable **flipper_length_mm**.

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center', echo=FALSE}
bayesion_model_5 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real<lower=0> sigma;
    }
    model {
      // Distribuciones a priori
      beta0 ~ normal(0, 8000); // Intercept
      beta1 ~ normal(0, 100);
      sigma ~ exponential(0.5);
  
      // Likelihood
      y ~ normal(beta0 + beta1 * x, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      x = colvalues(train_set, 'flipper_length_mm'),
      y  = colvalues(train_set, 'body_mass_g')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 180,
  thin   = 1
)

params_5 <- c('beta0', 'beta1', 'sigma')
traceplot(bayesion_model_5, pars = params_5, inc_warmup = TRUE)
```


#### 2. Coeficientes


```{r}
br_vs_br_coeficients(bayesion_model_1, bayesion_model_5, params_5)
```


#### 3. Validación

```{r}
lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
```
#### 4. Validacion

```{r}
vars_5 <- c('flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_5, params_5, vars_5, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=10, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_5 <- BayesianRegressionPredictor.from(bayesion_model_5, params_5, vars_5)

plot_compare_fit(
  bayesion_predictor_1,
  bayesion_predictor_5,
  train_set,
  label_1='Regresion Bayesiana con dist informativa', 
  label_2='Regresion Bayesiana con dist menos informativa'
)
```

### 2. Modelo bayesiano con parametro con distribucion muy informativa sesgada o con poca varianza

COMPLETAR



## Referencias

* [Making Predictions from Stan models in R](https://medium.com/@alex.pavlakis/making-predictions-from-stan-models-in-r-3e349dfac1ed)
* [How to represent a categorical predictor rstan?](https://stackoverflow.com/questions/29183577/how-to-represent-a-categorical-predictor-rstan)

